Boardgame Rating Prediction
Posted on Dim 23 septembre 2018 in Machine Learning
Predict the Rating for Board Games¶
The data set that contains 80000 board games with game informations and their associated review scores. I'm going to predict average_rating using the other columns.
In [2]:
import pandas as pd
board_games = pd.read_csv("board_games.csv")
board_games.head()
Out[2]:
Cleaning¶
In [3]:
board_games.dropna(axis=0, inplace = True)
board_games = board_games[board_games['users_rated'] > 0]
Data Exploration¶
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(board_games['average_rating'])
plt.show()
plt.boxplot(board_games['average_rating'])
plt.show()
std = board_games['average_rating'].std()
mean = board_games['average_rating'].mean()
print(std)
print(mean)
Error Metric¶
The distribution follow a normal distribution, so we can use mean squared error as an error metric
Clustering¶
In [5]:
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters = 5, random_state=1)
numeric_columns = board_games.iloc[:,3:]
kmeans_model.fit(numeric_columns)
labels = kmeans_model.labels_
import numpy
game_mean = numeric_columns.apply(numpy.mean, axis=1)
game_std = numeric_columns.apply(numpy.std, axis=1)
plt.scatter(x = game_mean, y = game_std, c = labels)
plt.show()
It looks like most of the games are similar, 4 clusters are between mean = 0 and and mean = 4000
Finding Correlations¶
Remove columns that don't add predictive power to the model.
In [6]:
correlations = board_games.corr()
print(correlations['average_rating'])
- The 'yearpublished' column is surprisingly positively correlated with average_rating. So most recent games tend to be rated more highly.
- The more 'minage' is high, the more highly is the score.
- The more "weighty" a game is (complexity rating of a game), the more highly it tends to be rated.
In [25]:
cols = list(board_games.columns)
cols.remove("average_rating")
cols.remove("bayes_average_rating")
cols.remove("minplayers")
cols.remove("maxplayers")
# not numeric values
cols.remove("name")
cols.remove("id")
cols.remove("type")
I removed useless columns, like 'bayes_average_rating' derivated from 'average_rating'
Linear Regression¶
In [28]:
from sklearn.linear_model import LinearRegression
# Training
lr = LinearRegression()
lr.fit(board_games[cols], board_games["average_rating"])
# Prediction
predictions = lr.predict(board_games[cols])
from sklearn.metrics import mean_squared_error
import math
mse = mean_squared_error(board_games['average_rating'], predictions)
rmse = math.sqrt(mse)
print(rmse)
The error rate is close to the standard deviation (1.57) of all board game ratings. This indicates that our model may not have high predictive power.